Support get/set the whole row of metaheader+weight+optimizer from backend for checkpoint saving/loading#4429

Closed

bobbyliujb wants to merge 1 commit intopytorch:mainfrom

bobbyliujb:export-D77604158

bobbyliujb commented Jul 1, 2025

Summary:

Context

In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

This diff

added backend_return_whole_row flag in KVZCH params, with validation to make sure it's only True when opt_offloading is used
added read_only_ flag in KVTensorWrapper to be used for checkpoint calls. When read-only=True, all write operations to this KVT will be no-op
added metadata recalc for optimizer state dict, because we are now returning read-only KVT for opt state dict, and model store will need to correct the global metadata before creating the save plan for KVZCH opt tensors
updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)
by default the opt offloading and return whole row is False on trunk, so should not break existing KVZCH runs

Differential Revision:
D77604158

Privacy Context Container: L1138451

netlify bot commented Jul 1, 2025 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`2940240`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/686c1cf56a0b7c0008ef7eb1
😎 Deploy Preview	https://deploy-preview-4429--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot added the cla signed label

Contributor

facebook-github-bot commented Jul 1, 2025

This pull request was exported from Phabricator. Differential Revision: D77604158

facebook-github-bot added the fb-exported label

q10 mentioned this pull request

add ckpt and restore with feature evict metaheader #4342

Open

bobbyliujb force-pushed the export-D77604158 branch from 63336e1 to bfffc90 Compare

July 1, 2025 20:28

Contributor

facebook-github-bot commented Jul 1, 2025

This pull request was exported from Phabricator. Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/torchrec that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

9ca1b52

…kend for checkpoint saving/loading (meta-pytorch#3148)

Summary:
X-link: pytorch/FBGEMM#4429

X-link: facebookresearch/FBGEMM#1495


# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff
* added `backend_return_whole_row` flag in KVZCH params, with validation to make sure it's only True when opt_offloading is used
* added `read_only_` flag in KVTensorWrapper to be used for checkpoint calls. When read-only=True, all write operations to this KVT will be no-op
* added metadata recalc for optimizer state dict, because we are now returning read-only KVT for opt state dict, and model store will need to correct the global metadata before creating the save plan for KVZCH opt tensors
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)
* by default the opt offloading and return whole row is False on trunk, so should not break existing KVZCH runs

Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from bfffc90 to d4566dc Compare

July 1, 2025 22:09

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

d4566dc

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff
* added `backend_return_whole_row` flag in KVZCH params, with validation to make sure it's only True when opt_offloading is used
* added `read_only_` flag in KVTensorWrapper to be used for checkpoint calls. When read-only=True, all write operations to this KVT will be no-op
* added metadata recalc for optimizer state dict, because we are now returning read-only KVT for opt state dict, and model store will need to correct the global metadata before creating the save plan for KVZCH opt tensors
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)
* by default the opt offloading and return whole row is False on trunk, so should not break existing KVZCH runs

Differential Revision: D77604158

Contributor

facebook-github-bot commented Jul 1, 2025

This pull request was exported from Phabricator. Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/torchrec that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

90e9932

…kend for checkpoint saving/loading (meta-pytorch#3148)

Summary:
X-link: pytorch/FBGEMM#4429

X-link: facebookresearch/FBGEMM#1495


# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff
* added `backend_return_whole_row` flag in KVZCH params, with validation to make sure it's only True when opt_offloading is used
* added `read_only_` flag in KVTensorWrapper to be used for checkpoint calls. When read-only=True, all write operations to this KVT will be no-op
* added metadata recalc for optimizer state dict, because we are now returning read-only KVT for opt state dict, and model store will need to correct the global metadata before creating the save plan for KVZCH opt tensors
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)
* by default the opt offloading and return whole row is False on trunk, so should not break existing KVZCH runs

Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

4ef1117

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff
* added `backend_return_whole_row` flag in KVZCH params, with validation to make sure it's only True when opt_offloading is used
* added `read_only_` flag in KVTensorWrapper to be used for checkpoint calls. When read-only=True, all write operations to this KVT will be no-op
* added metadata recalc for optimizer state dict, because we are now returning read-only KVT for opt state dict, and model store will need to correct the global metadata before creating the save plan for KVZCH opt tensors
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)
* by default the opt offloading and return whole row is False on trunk, so should not break existing KVZCH runs

Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from d4566dc to 4ef1117 Compare

July 1, 2025 22:13

Contributor

facebook-github-bot commented Jul 1, 2025

This pull request was exported from Phabricator. Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff
* added `backend_return_whole_row` flag in KVZCH params, with validation to make sure it's only True when opt_offloading is used
* added `read_only_` flag in KVTensorWrapper to be used for checkpoint calls. When read-only=True, all write operations to this KVT will be no-op
* added metadata recalc for optimizer state dict, because we are now returning read-only KVT for opt state dict, and model store will need to correct the global metadata before creating the save plan for KVZCH opt tensors
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)
* by default the opt offloading and return whole row is False on trunk, so should not break existing KVZCH runs

Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from 4ef1117 to 7776299 Compare

July 1, 2025 22:13

bobbyliujb pushed a commit to bobbyliujb/torchrec that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

bef50f7

…kend for checkpoint saving/loading (meta-pytorch#3148)

Summary:
X-link: pytorch/FBGEMM#4429

X-link: facebookresearch/FBGEMM#1495


# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff
* added `backend_return_whole_row` flag in KVZCH params, with validation to make sure it's only True when opt_offloading is used
* added `read_only_` flag in KVTensorWrapper to be used for checkpoint calls. When read-only=True, all write operations to this KVT will be no-op
* added metadata recalc for optimizer state dict, because we are now returning read-only KVT for opt state dict, and model store will need to correct the global metadata before creating the save plan for KVZCH opt tensors
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)
* by default the opt offloading and return whole row is False on trunk, so should not break existing KVZCH runs

Differential Revision: D77604158

Contributor

facebook-github-bot commented Jul 1, 2025

This pull request was exported from Phabricator. Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/torchrec that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

720de57

…kend for checkpoint saving/loading (meta-pytorch#3148)

Summary:
X-link: pytorch/FBGEMM#4429

X-link: facebookresearch/FBGEMM#1495

Pull Request resolved: meta-pytorch#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff
* added `backend_return_whole_row` flag in KVZCH params, with validation to make sure it's only True when opt_offloading is used
* added `read_only_` flag in KVTensorWrapper to be used for checkpoint calls. When read-only=True, all write operations to this KVT will be no-op
* added metadata recalc for optimizer state dict, because we are now returning read-only KVT for opt state dict, and model store will need to correct the global metadata before creating the save plan for KVZCH opt tensors
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)
* by default the opt offloading and return whole row is False on trunk, so should not break existing KVZCH runs

Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

8c7a58c

…kend for checkpoint saving/loading (pytorch#4429)

Summary:
Pull Request resolved: pytorch#4429

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff
* added `backend_return_whole_row` flag in KVZCH params, with validation to make sure it's only True when opt_offloading is used
* added `read_only_` flag in KVTensorWrapper to be used for checkpoint calls. When read-only=True, all write operations to this KVT will be no-op
* added metadata recalc for optimizer state dict, because we are now returning read-only KVT for opt state dict, and model store will need to correct the global metadata before creating the save plan for KVZCH opt tensors
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)
* by default the opt offloading and return whole row is False on trunk, so should not break existing KVZCH runs

Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch 2 times, most recently from 8c7a58c to 0c5b161 Compare

July 2, 2025 17:37

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

0c5b161

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Reviewed By: emlin

Differential Revision: D77604158

Contributor

facebook-github-bot commented Jul 2, 2025

This pull request was exported from Phabricator. Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

c3c4f7c

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Reviewed By: emlin

Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from 0c5b161 to a9f117f Compare

July 7, 2025 18:43

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

a9f117f

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

cefb8fe

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from a9f117f to cefb8fe Compare

July 7, 2025 18:44

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

2c3f906

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

3daa067

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

Contributor

facebook-github-bot commented Jul 7, 2025

This pull request was exported from Phabricator. Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

7e8f169

…kend for checkpoint saving/loading (pytorch#4429)

Summary:
Pull Request resolved: pytorch#4429

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from cefb8fe to 7e8f169 Compare

July 7, 2025 18:47

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

6c3efaf

…kend for checkpoint saving/loading (pytorch#4429)

Summary:
Pull Request resolved: pytorch#4429

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision:
D77604158

Privacy Context Container: L1138451

Reviewed By: emlin

Contributor

facebook-github-bot commented Jul 7, 2025

This pull request was exported from Phabricator. Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

7672d1d

…kend for checkpoint saving/loading (pytorch#4429)

Summary:
Pull Request resolved: pytorch#4429

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from 7e8f169 to 7672d1d Compare

July 7, 2025 18:58

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

c71e102

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from 7672d1d to c71e102 Compare

July 7, 2025 19:04

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

13e11ac

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

8a692ff

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from c71e102 to 8a692ff Compare

July 7, 2025 19:05

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

55ea98f

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

92cdb57

…kend for checkpoint saving/loading (pytorch#4429)

Summary:

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

Contributor

facebook-github-bot commented Jul 7, 2025

This pull request was exported from Phabricator. Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from 8a692ff to 9dcd607 Compare

July 7, 2025 19:10

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

9dcd607

…kend for checkpoint saving/loading (pytorch#4429)

Summary:
Pull Request resolved: pytorch#4429

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

bobbyliujb pushed a commit to bobbyliujb/FBGEMM-1 that referenced this pull request


          Support get/set the whole row of metaheader+weight+optimizer from bac…

9afa830

…kend for checkpoint saving/loading (pytorch#4429)

Summary:
Pull Request resolved: pytorch#4429

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision:
D77604158

Privacy Context Container: L1138451

Reviewed By: emlin


          Support get/set the whole row of metaheader+weight+optimizer from bac…

…kend for checkpoint saving/loading (pytorch#4429)

Summary:
Pull Request resolved: pytorch#4429

X-link: facebookresearch/FBGEMM#1495

X-link: meta-pytorch/torchrec#3148

# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.

# This diff only contains backend change
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)

Differential Revision: D77604158

Contributor

facebook-github-bot commented Jul 7, 2025

This pull request was exported from Phabricator. Differential Revision: D77604158

bobbyliujb force-pushed the export-D77604158 branch from 9dcd607 to 2940240 Compare

July 7, 2025 19:16

facebook-github-bot closed this in

f2e75f5

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Jul 8, 2025

This pull request has been merged in f2e75f5.

gchalump added category:new feature:gemm labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category:new cla signed fb-exported feature:gemm Merged